MRC London Institute of Medical Sciences (http://bioinformatics.lms.mrc.ac.uk)
25/Sep/2018
data summary
hypothesis testing
correlation, linear regression and ANOVA
we only have 1 hour and 45 minutes…
All prerequisites, links to material and slides for this course can be found on github.
Or can be downloaded as a zip archive from here.
Once the zip files are unzipped, all presentations are available as HTML slides and pages. Practical sheets will be available in the directories underneath.
Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.
You may navigate to the unarchived Reproducible-R folder in the Rstudio menu
Session -> Set Working Directory -> Choose Directory
or in the console.
setwd("/Users/yfwang27/workfromhome/workshop/StatisticsR_CBW18")
# e.g. setwd("~/Downloads/StatisticsInR/Statistics2016")
Start from some data for mouse strain BKS.Cg-Dock7m +/+ Leprdb/J (db/db)
Data Source: Jackson Laboratory
alldata<-read.csv(file="data/mouse_BW_Fat_Glu_info.csv")
data type
spread of data
shape of data
distribution
data type
eg. blood pressure, body weight, height …
eg. gender, hair colour, RNA-seq read count…
– [Data types in R]
head(): See the first 6 lines of an R object
head(alldata)
mouseID Sex Genotype Age ID BW.gram FatTissue.percent GLU.mg.dL
1 F_+/+_16_1 F +/+ 16 1 29.98 20.0 162
2 F_+/+_16_2 F +/+ 16 2 20.03 20.4 170
3 F_+/+_16_3 F +/+ 16 3 21.94 25.2 155
4 F_+/+_16_4 F +/+ 16 4 21.60 20.0 181
5 F_+/+_8_1 F +/+ 8 1 19.55 22.9 135
6 F_+/+_8_2 F +/+ 8 2 17.38 22.0 146
see the how many rows and columns in “alldata” object
dim(alldata)
[1] 74 8
see column names of “alldata”
colnames(alldata)
[1] "mouseID" "Sex" "Genotype"
[4] "Age" "ID" "BW.gram"
[7] "FatTissue.percent" "GLU.mg.dL"
str(): Compactly display the internal structure of an R object
Make sure the data format is correct for each column.
str(alldata)
'data.frame': 74 obs. of 8 variables:
$ mouseID : Factor w/ 74 levels "F_+/+_16_1","F_+/+_16_2",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Genotype : Factor w/ 3 levels "+/+","db/+","db/db": 1 1 1 1 1 1 1 1 1 2 ...
$ Age : int 16 16 16 16 8 8 8 8 8 16 ...
$ ID : int 1 2 3 4 1 2 3 4 5 1 ...
$ BW.gram : num 30 20 21.9 21.6 19.6 ...
$ FatTissue.percent: num 20 20.4 25.2 20 22.9 22 26 25.2 21.5 33.4 ...
$ GLU.mg.dL : int 162 170 155 181 135 146 152 153 144 159 ...
summary(alldata)
mouseID Sex Genotype Age ID
F_+/+_16_1: 1 F:35 +/+ :19 Min. : 8 Min. : 1.00
F_+/+_16_2: 1 M:39 db/+ :15 1st Qu.: 8 1st Qu.: 2.00
F_+/+_16_3: 1 db/db:40 Median :12 Median : 4.00
F_+/+_16_4: 1 Mean :12 Mean : 4.23
F_+/+_8_1 : 1 3rd Qu.:16 3rd Qu.: 6.00
F_+/+_8_2 : 1 Max. :16 Max. :10.00
(Other) :68
BW.gram FatTissue.percent GLU.mg.dL
Min. :16.04 Min. :20.00 Min. :131.0
1st Qu.:26.87 1st Qu.:26.70 1st Qu.:195.8
Median :38.15 Median :54.15 Median :509.0
Mean :37.36 Mean :44.91 Mean :448.1
3rd Qu.:47.80 3rd Qu.:60.70 3rd Qu.:664.2
Max. :61.97 Max. :71.10 Max. :876.0
alldata$Age<-as.factor(alldata$Age)
summary(alldata)
mouseID Sex Genotype Age ID BW.gram
F_+/+_16_1: 1 F:35 +/+ :19 8 :37 Min. : 1.00 Min. :16.04
F_+/+_16_2: 1 M:39 db/+ :15 16:37 1st Qu.: 2.00 1st Qu.:26.87
F_+/+_16_3: 1 db/db:40 Median : 4.00 Median :38.15
F_+/+_16_4: 1 Mean : 4.23 Mean :37.36
F_+/+_8_1 : 1 3rd Qu.: 6.00 3rd Qu.:47.80
F_+/+_8_2 : 1 Max. :10.00 Max. :61.97
(Other) :68
FatTissue.percent GLU.mg.dL
Min. :20.00 Min. :131.0
1st Qu.:26.70 1st Qu.:195.8
Median :54.15 Median :509.0
Mean :44.91 Mean :448.1
3rd Qu.:60.70 3rd Qu.:664.2
Max. :71.10 Max. :876.0
ftable(): Create ‘flat’ contingency tables
ftable(alldata[,c("Age","Genotype")])
Genotype +/+ db/+ db/db
Age
8 10 7 20
16 9 8 20
library(ggplot2)
ggplot(alldata, aes(x=Genotype, y=BW.gram, fill=Age)) + geom_boxplot()
library(ggplot2)
ggplot(alldata, aes(x=Genotype, y=BW.gram, fill=Age)) + geom_violin(position=position_dodge(width = 0.5)) +geom_boxplot(width=.1, outlier.colour=NA,position=position_dodge(width = 0.5))
WT_data<-alldata[alldata$Genotype=="+/+",]
WT_data$BW.gram
[1] 29.98 20.03 21.94 21.60 19.55 17.38 19.67 18.10 16.04 27.77 30.93
[12] 29.28 26.57 29.65 22.49 23.45 24.84 23.20 23.26
min(), max(), median(), range() and quantile() functions
min.BW.gram<-min(WT_data$BW.gram)
max.BW.gram<-max(WT_data$BW.gram)
median.BW.gram<-median(WT_data$BW.gram)
mean.BW.gram<-mean(WT_data$BW.gram)
c(min.BW.gram, max.BW.gram, median.BW.gram, mean.BW.gram)
[1] 16.04000 30.93000 23.20000 23.45947
range(WT_data$BW.gram)
[1] 16.04 30.93
quantile(WT_data$BW.gram)
0% 25% 50% 75% 100%
16.04 19.85 23.20 27.17 30.93
Boxplot
c(min.BW.gram, max.BW.gram)
[1] 16.04 30.93
c(median.BW.gram, mean.BW.gram)
[1] 23.20000 23.45947
quantile(WT_data$BW.gram)[c(2,4)]
25% 75%
19.85 27.17
summary()
summary(WT_data$BW.gram)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.04 19.85 23.20 23.46 27.17 30.93
range(): show the minimum and maximum
range(WT_data$BW.gram)
[1] 16.04 30.93
IQR(): show the interquartile range, i.e. 3rd quartile - 1st quartile
IQR(WT_data$BW.gram)
[1] 7.32
Scatter plot: plot the WT mice's Body Weight against index
plot(WT_data$BW.gram,ylab="Body Weight (gram)")
sort the data from min to max
start to see something here…
hist(WT_data$BW.gram,breaks=10)
hist(WT_data$BW.gram,breaks=10,freq = F)
lines(density(WT_data$BW.gram),col="red")
\[ \begin{aligned} \overline x = \frac{{\displaystyle\sum_{i=1}^n}x_i}n \\ \\ \\ \text{Variance} = \sigma^2 = \frac{{\displaystyle\sum_{i=1}^n}(\left|x_i-\overline x\right|)^2}{n-1} \\ \\ \\ \text{Standard deviation} = \sigma =\sqrt{\text{Variance}} \\ \\ \\ \end{aligned} \]
var.BW.gram<-sum((WT_data$BW.gram-mean.BW.gram)^2)/(nrow(WT_data)-1)
sd.BW.gram<-sqrt(var.BW.gram)
c(var.BW.gram, sd.BW.gram)
[1] 20.53558 4.53162
var(WT_data$BW.gram)
[1] 20.53558
sd(WT_data$BW.gram)
[1] 4.53162
we use the SD more often because it has the same units as the data BUT, if you know one, then you automatically know the other as well.
in many analysis, variances are used more often, i.e. F-test
Exercise on this part can be found here
Answers can be found here
Example: flip a fair (50% of head and 50% of tail) coin 10 times
\[ X \sim Binom(n,p) \]
\[ n=\text{number of experiment} \\ p=\text{probability of success} \]
\[ E(X)=np \\ Var(X)=\frac{p(1-p)}n \\ \]
In this case:
\[ X \sim B(10,0.5) \]
Example: body temperature of 150 MRC LMS staff
\[ X \sim Normal(\mu,\sigma^2) \\ \mu=mean \\ \sigma=\text{standard deviation} \]
R comes with functions for extracting information from most common distibutions types. An example of standard R functions for dealing with distibution can be seen here using the normal distributions.
Similar functions are available for other distibution types including:
We can use rnorm() function to generate random values following a normal distribution. Here we produce 10 normally distributed numeric values with mean 8 and standard deviation of 3
set.seed(2)
rnorm(10,mean=8,sd=3)
[1] 5.309256 8.554548 12.763536 4.608873 7.759245 8.397261 10.123864
[8] 7.280906 13.953422 7.583639
If you want to regenerate the exact random numbers, use the set.seed() function before generating any random numbers.
We can also use these functions to interrogate values assuming a normal distribution for the data.
The probablity of a value being VERY close to 8 (or exactly 8 for discrete distributions) for a distribution of mean 8 and standard deviation 3.
dnorm(8,mean=8,sd=3)
[1] 0.1329808
The probablity (P(X<=x)) of a value being less than 8 for a distribution of mean 8 and standard deviation 3.
pnorm(8,mean=8,sd=3)
[1] 0.5
The value for which i have a 50 percent being greater than given a normal distribution of mean 8 and standard deviation 3.
qnorm(0.5,mean=8,sd=3)
[1] 8
\[ X \sim Normal(\mu,\sigma^2) \]
Z-score transformation
\[ Z=\frac{X-\mu}\sigma \\ \\ Z \sim Normal(0,1) \]
x<-rnorm(10000,mean=5, sd=2.5)
hist(x)
ztransfer<-scale(x)
hist(ztransfer)
SD (standard deviation) and SE (standard error; standard error of sample mean)
Confidence Interval (CI)
Hypothesis testing
– parametric test:e.g. t-test
– non-parametric test: e.g. Wilcoxon test; chi-square test and Fisher's exact test
Analysis of Variance (ANOVA)
On top of descriptive statistics, R has several statistical tests covering a range of problems and data types.
Some common tests include:
t.test()
one-sample t-test
t.test(groupA,mu=something)
independent t-test
We are going to discuss this case here.
t.test(groupA,groupB,paired=FALSE)
paired t-test
t.test(Patients_before_treatment,Patients_after_treatment,paired=TRUE)
Use the WT dataset as example: Is the body weight of WT in Age 8 weeks and Age 16 weeks different. For the purpose of this session, let's assume the mouse body weight is normally distributed in WT 8 weeks and WT 16 weeks.
WT_data<-alldata[alldata$Genotype=="+/+",]
boxplot(BW.gram~Age,data=WT_data)
What is the difference in variances between WT Age 8 weeks and WT Age 16 weeks?
F test
\[ F= \frac{S^2_x}{S^2_y} \\ S^2_x:\text{ sample varience for group x} \\ S^2_y:\text{ sample varience for gorup y} \\\\ \text{degrees of freedom for the numerator}=n_x-1 \\ \text{degrees of freedom for the denominator}=n_y-1 \]
We can test for any differences in variances between WT 8 weeks and WT 16 weeks with an F-test using the var.test() function.
\[ H_0:\sigma_{WT_8w}^{2}= \sigma_{WT_16w}^{2} \\ H_a:\sigma_{WT_8w}^{2}\neq \sigma_{WT_16w}^{2} \]
WT_8w_data<-WT_data[WT_data$Age=="8" ,]
WT_16w_data<-WT_data[WT_data$Age=="16" ,]
var.test(WT_8w_data$BW.gram,WT_16w_data$BW.gram)
F test to compare two variances
data: WT_8w_data$BW.gram and WT_16w_data$BW.gram
F = 0.53289, num df = 9, denom df = 8, p-value = 0.3673
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.1223004 2.1858962
sample estimates:
ratio of variances
0.5328912
In R, special objects (S3 or S4 objects) can be created which have methods associated to them. The result from var.test is an object of class htest.
Since we have not come across this before, in order to discover its structure we can use the str() function with the object of interest as the argument.
result <- var.test(WT_8w_data$BW.gram,WT_16w_data$BW.gram)
str(result)
List of 9
$ statistic : Named num 0.533
..- attr(*, "names")= chr "F"
$ parameter : Named int [1:2] 9 8
..- attr(*, "names")= chr [1:2] "num df" "denom df"
$ p.value : num 0.367
$ conf.int : num [1:2] 0.122 2.186
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num 0.533
..- attr(*, "names")= chr "ratio of variances"
$ null.value : Named num 1
..- attr(*, "names")= chr "ratio of variances"
$ alternative: chr "two.sided"
$ method : chr "F test to compare two variances"
$ data.name : chr "WT_8w_data$BW.gram and WT_16w_data$BW.gram"
- attr(*, "class")= chr "htest"
Now we know the structure and class of the htest object we can access the slots containing information we want just as with a named list.
The p-value
result$p.value
[1] 0.3673436
The statistic
result$statistic
F
0.5328912
The data used in function call
result$data.name
[1] "WT_8w_data$BW.gram and WT_16w_data$BW.gram"
We have ascertained that Ration1 and Ration2 have similar variances. We can therefore perform a standard t-test to assess the significance of differences between these groups.
\[ H_0:\mu_{WT_{8w}}= \mu_{WT_{16w}} \\ H_a:\mu_{WT_{8w}}\neq \mu_{WT_{16w}} \]
test_res <- t.test(WT_8w_data$BW.gram,WT_16w_data$BW.gram,alternative ="two.sided", var.equal = T)
test_res
Two Sample t-test
data: WT_8w_data$BW.gram and WT_16w_data$BW.gram
t = -3.3985, df = 17, p-value = 0.003419
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-9.106749 -2.130584
sample estimates:
mean of x mean of y
20.79800 26.41667
The same result to that shown could be achieved by specifying a formula for the comparison. Here we wish to compare 8 weeks versus 16 weeks so we could simply specify the formula and the data to be used.
result_formula <- t.test(BW.gram~Age,WT_data,alternative ="two.sided", var.equal = T)
result_formula
Two Sample t-test
data: BW.gram by Age
t = -3.3985, df = 17, p-value = 0.003419
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-9.106749 -2.130584
sample estimates:
mean in group 8 mean in group 16
20.79800 26.41667
Non-parametric statistical hypothesis test is a test that is not based on probability distribution for the dependant variable.
It doesn't repuire the dependent varible to be normally distributed.
wilcox.test()
Wilcoxon Signed-Rank Test is one of the Non-parametric statistical hypothesis tests. It is a good alternative to t-tests without assuming the dependent variables to follow the normal distribution.
t.test()
one-sample t-test
t.test(groupA,mu=something)
independent t-test
t.test(groupA,groupB,paired=FALSE)
paired t-test
t.test(groupA,groupB,paired=TRUE)
wilcox.test()
one-sample Wilcoxon: Signed-Rank Test
wilcox.test(groupA,mu=something)
Wilcoxon Rank Sum Test: Mann-Whitney U
wilcox.test(groupA,groupB,paired=FALSE)
paired Wilcoxon:Signed-Rank Test
wilcox.test(groupA,groupB,paired=TRUE)
Wilcoxon Signed-Rank Test is one of the Non-parametric statistical hypothesis tests. It is a good alternative to t-tests without assuming them to follow the normal distribution.
\[ H_0: \text{median}_{a}- \text{median}_{b} = 0 \\ H_a: \text{median}_{a}- \text{median}_{b}\neq 0 \]
Is there body weight different between WT and KO mice?
WT_dbdb_data<-alldata[alldata$Genotype!="db/+",]
WT_dbdb_data<-droplevels(WT_dbdb_data)
boxplot(BW.gram~Genotype,data=WT_dbdb_data)
qqnorm() and qqline()
Check normal distribution with normal quantile plots for WT data
WT4wilcox<-WT_dbdb_data[WT_dbdb_data$Genotype=="+/+",]
qqnorm(WT4wilcox$BW.gram)
qqline(WT4wilcox$BW.gram)
shapiro.test(WT4wilcox$BW.gram)
Shapiro-Wilk normality test
data: WT4wilcox$BW.gram
W = 0.95302, p-value = 0.4441
qqnorm() and qqline()
Check normal distribution with normal quantile plots for db/db data
KO4wilcox<-WT_dbdb_data[WT_dbdb_data$Genotype=="db/db",]
qqnorm(KO4wilcox$BW.gram)
qqline(KO4wilcox$BW.gram)
shapiro.test(KO4wilcox$BW.gram)
Shapiro-Wilk normality test
data: KO4wilcox$BW.gram
W = 0.90465, p-value = 0.002618
wilcox.test()
wilcox.test(WT4wilcox$BW.gram,KO4wilcox$BW.gram, paired=F)
Wilcoxon rank sum test
data: WT4wilcox$BW.gram and KO4wilcox$BW.gram
W = 0, p-value = 1.431e-15
alternative hypothesis: true location shift is not equal to 0
Given two gene lists, tests the significance of their overlap in comparison with a genomic background.
\[ H_0:\text{ the odds ratio is no larger than 1} \\ H_a:\text{ the odds ratio is larger than 1 } \]
Assuming there are 20,000 genes in the mouse genome, we have gene list A (300 genes) and B (50 genes). The number of overlap genes between list A and B is 5. Is the overlap between the two list significant?
create a contigency table
cmatrix<-matrix(c(5,40,295,19960),byrow=T,ncol=2,dimnames=list(c("In.B","Not.In.B"),c("In.A","Not.In.A")))
cmatrix
In.A Not.In.A
In.B 5 40
Not.In.B 295 19960
fisher.test()
fisher.test(cmatrix)
Fisher's Exact Test for Count Data
data: cmatrix
p-value = 0.0005133
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
2.585939 21.637276
sample estimates:
odds ratio
8.454877
Exercise on this session can be found here
Answers can be found here
Compute analysis of variance (or deviance), a.k.a. ANOVA, for one or more fitted model objects.
ANOVA is a statistical method that uses F-test to test
\[ H_0:\mu_{1}= \mu_{2}=... \mu_{k} \]
by comparing the variability between groups to the variability within groups
Assume that
(1) all samples are independent and have >2 categorical groups;
(2) dependent variable is continuous
(3) data of each group is normally distributed
(4) homogeneity of variances
boxplot(BW.gram~Genotype,data=alldata)
lmPG<-lm(formula = BW.gram ~ Genotype,data = alldata)
lmPG
Call:
lm(formula = BW.gram ~ Genotype, data = alldata)
Coefficients:
(Intercept) Genotypedb/+ Genotypedb/db
23.459 5.131 23.796
anova_PG<-anova(lmPG)
anova_PG
Analysis of Variance Table
Response: BW.gram
Df Sum Sq Mean Sq F value Pr(>F)
Genotype 2 8741.8 4370.9 101.48 < 2.2e-16 ***
Residuals 71 3058.0 43.1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD - Test which of the groups have different means
TukeyHSD(aov(lmPG))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = lmPG)
$Genotype
diff lwr upr p adj
db/+-+/+ 5.131193 -0.2950406 10.55743 0.0676741
db/db-+/+ 23.796276 19.4190332 28.17352 0.0000000
db/db-db/+ 18.665083 13.9085869 23.42158 0.0000000
A common task in statistical analysis is to investigate the linear relationship between pairs of numeric vectors.
This can be done by identifying the correlation between numeric vectors using the cor() function in R.
In this example we use cor() to identify the Pearson correlation between two variables. The method argument may be set to make use of different correlation methods.
x <- rnorm(100,10,2)
z <- rnorm(100,10,2)
y <- x
cor(x,y) #
[1] 1
cor(x,-y)
[1] -1
cor(x,z)
[1] -0.01102598
Example of our mouse data. We would like to see whether there is a relationship between body weight and the percentage of fat tissue in the WT mice.
KO_data<-alldata[alldata$Genotype=="db/db",]
head(KO_data)
mouseID Sex Genotype Age ID BW.gram FatTissue.percent GLU.mg.dL
16 F_db/db_16_1 F db/db 16 1 57.09 63.1 876
17 F_db/db_16_10 F db/db 16 10 61.14 63.3 669
18 F_db/db_16_2 F db/db 16 2 49.07 69.2 830
19 F_db/db_16_3 F db/db 16 3 55.31 64.7 785
20 F_db/db_16_4 F db/db 16 4 56.39 63.2 852
21 F_db/db_16_5 F db/db 16 5 49.24 63.2 706
cor(KO_data$BW.gram,KO_data$FatTissue.percent)
[1] 0.8436125
Often we wish to apply correlation analysis to all columns or rows in a matrix in a pair-wise manner. To do this in R, we can simply pass the cor() function a single argument of the numeric matrix of interest. The cor() function will then perform all pair-wise correlations between columns.
mouse4cor<-KO_data[,c(6:8)];
cor(mouse4cor)
BW.gram FatTissue.percent GLU.mg.dL
BW.gram 1.0000000 0.8436125 0.6511112
FatTissue.percent 0.8436125 1.0000000 0.5751244
GLU.mg.dL 0.6511112 0.5751244 1.0000000
pairs(mouse4cor)
We have seen how we can find the linear correlation between two sets of variables using cor() function.
R also provides a comprehensive set of tools for regression analysis including the well used linear modeling function lm()
minimize the vertical distance between the fitted line and data points
We use KO mouse dataset as example and see whether we can use mouse body weight to predict the percentage of fat tissue.
KO_data<-alldata[alldata$Genotype=="db/db",]
head(KO_data)
mouseID Sex Genotype Age ID BW.gram FatTissue.percent GLU.mg.dL
16 F_db/db_16_1 F db/db 16 1 57.09 63.1 876
17 F_db/db_16_10 F db/db 16 10 61.14 63.3 669
18 F_db/db_16_2 F db/db 16 2 49.07 69.2 830
19 F_db/db_16_3 F db/db 16 3 55.31 64.7 785
20 F_db/db_16_4 F db/db 16 4 56.39 63.2 852
21 F_db/db_16_5 F db/db 16 5 49.24 63.2 706
The lm() function fits a linear regression to your data and provides useful information on the generated fit.
In the example below we fit a linear model using lm() on the KO_data dataset with FatTissue.percent (Y) as the dependent variable and BW.gram (X) as the explanatory variable.
lmResult<-lm(formula = FatTissue.percent ~ BW.gram, data = KO_data)
lmResult
Call:
lm(formula = FatTissue.percent ~ BW.gram, data = KO_data)
Coefficients:
(Intercept) BW.gram
36.6140 0.5026
As we have seen, printing the model result provides the intercept and slope of line. To get some more information on the model we can use the summary() function
summary(lmResult)
Call:
lm(formula = FatTissue.percent ~ BW.gram, data = KO_data)
Residuals:
Min 1Q Median 3Q Max
-4.4038 -1.5671 -0.0821 1.1951 7.9231
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.6140 2.4861 14.727 < 2e-16 ***
BW.gram 0.5026 0.0519 9.685 8.25e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.583 on 38 degrees of freedom
Multiple R-squared: 0.7117, Adjusted R-squared: 0.7041
F-statistic: 93.8 on 1 and 38 DF, p-value: 8.247e-12
lmResult$coefficients
(Intercept) BW.gram
36.6140076 0.5026053
From the $coefficients of object lmResult, we know the equation for the best fit is
\[ Y = 36.6140076 + 0.5026053 *X \]
\[ f(x) = b_0 + b_1x \]
\[ b_0\text{: the value of f(x) when x =0} \]
# the Intercept 36.6140076 is the expected percentage of fat tissue of a 0 body weight
# not interesting to any biological questions
\[ b_1\text{: the amount of f(x) will change when x changes 1 unit} \]
# For every gram increased in the mice body weight, we expect 0.50 (%) increased in the Fat tissue
Predict the percentage of fat tissue with the body weight information.
If we have 3 KO mice with weight = 40, 55 and 66 grams, how do we predict their percentage of fat tissue?
Use the information from the $coefficients
new_mouse_bw<-c(40,55,66)
beta0<-lmResult$coefficients[1]
beta1<-lmResult$coefficients[2]
predicted_new_fat<-beta0+beta1*new_mouse_bw
predicted_new_fat
[1] 56.71822 64.25730 69.78596
Or use the predict()
new_mouse_bw_df <- data.frame(BW.gram=c(40,55,66))
cleaver_predicted_fat<-predict(lmResult,new_mouse_bw_df)
cleaver_predicted_fat
1 2 3
56.71822 64.25730 69.78596
The residuals are the difference between the predicted and actual values. To retrieve the residuals we can access the slot or use the resid() function.
summary(resid(lmResult))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.40385 -1.56712 -0.08214 0.00000 1.19506 7.92315
summary(lmResult$residual)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.40385 -1.56712 -0.08214 0.00000 1.19506 7.92315
Ideally you would want your residuals to be normally distributed around 0.
\[ E[e_{i}]=0 \]
Plot the residuals
plot(KO_data$BW.gram,KO_data$FatTissue.percent,ylim=c(50,75),
ylab="Fat tissue (%)",xlab="body weight (gram)")
abline(lmResult,col="blueviolet",lwd=3, lty=1)
Residual is the vertical distance between the observed data and the regression line. It has the same unit as the dependent variable.
SSE shows the residual variability
It shows the variability that cannot be explained by the regression model
\[ Error_i = y_i - \hat{y_i} \\ y_i\text{: the observed weight of ith kid} \\ \hat{y_i}\text{: the predicted weight of ith kid} \\ Error_i^2 = (y_i - \hat{y_i})^2 \\ \text{- sum of the square of the residuals (SSE)} \\ SSE = \sum_{i=1}^{n}(y_i-\hat{y_i})^2 \]
Plot the residuals against the independent variable (X), i.e. the body weight It makes the residual accessment easiler by eyes.
plot(KO_data$BW.gram,lmResult$residual,ylim=c(-10,10),
ylab="residuals (fat tissue %)",xlab="body weight (gram)")
abline(h=0,col="blueviolet",lwd=3, lty=1)
Plot the residuals against the independent variable (X), i.e. the body weight.
Plot the residuals against the independent variable (X)
\[ Error_i = y_i - \hat{y_i} \\ Error_i^2 = (y_i - \hat{y_i})^2 \\ \text{- sum of the square of the residuals (SSE)} \\ SSE = \sum_{i=1}^{n}(y_i-\hat{y_i})^2 \]
The R-squared value represents the proportion of variability in the response variable that is explained by the explanatory variable.
A high R-squared here indicates that the line fits closely to the data.
summary(lmResult)$r.squared
[1] 0.711682
KO_data$FatTissue.percent
[1] 63.1 63.3 69.2 64.7 63.2 63.2 71.1 62.7 67.0 63.7 57.7 54.8 56.8 53.5
[15] 60.1 55.1 55.4 57.8 56.2 57.3 59.7 64.2 62.9 58.9 63.8 64.8 67.5 65.8
[29] 64.5 64.4 56.1 54.3 56.9 58.1 53.4 60.9 58.4 54.7 54.0 55.4
Question: How would you describe (or summarize) the percentage of fat tissue when the body weight information is absence? Which information you would use to predict a new mouse's percentage of fat tissue?
mean might be a good choice
mean(KO_data$FatTissue.percent)
[1] 60.365
Question: How would you describe (or summarize) kid's weight when the height information is absence? Which information you would use to predict a new child's weight?
mean might be a good choice
Residuals from the mean: assuming the independent variable (X), i.e. height in our case, does not exist
\[ TSS=\text{Total Sum of Squares}=\sum_{i=1}^n(y_i-\overline y)^2 \]
Residuals from the mean: assuming the independent variable (X), i.e. height in our case, does not exist
\[ \begin{aligned} TSS = \sum_{i=1}^{n}(y_i-\overline y)^2 \end{aligned} \]
Residuals from the model
The fraction of variability in the independent variable (Y; or the weight in this example) that can be explained by the explanatory variable (X; or the height in this example).
\[ TSS=\text{Total Sum of Squares}=\sum_{i=1}^n(y_i-\overline y)^2 \\ SSE=\text{Sum of the Square of the residuals}=\sum_{i=1}^n(y_i-\hat{y})^2 \]
SSE<-sum(resid(lm(FatTissue.percent~BW.gram,data=KO_data))^2)
TSS<-sum(resid(lm(FatTissue.percent~1,data=KO_data))^2)
R_square<-1-(SSE/TSS)
R_square
[1] 0.711682
summary(lmResult)$r.squared
[1] 0.711682
The R-squared shows the fraction of the total variability that is explained by the linear relationship with the explanatory variable. However, it does not provide a formal hypothesis test for this relationship.
The F-test results from linear models also provides a measure of significance for a variable not being relevant
summary(lmResult)$fstatistic
value numdf dendf
93.79893 1.00000 38.00000
\[ F=\frac{MSM}{MSE}=\frac{\text{mean of the explained variance}}{\text{mean of the unexplained variance}}=\frac{({\displaystyle\frac{SSM}1})}{({\displaystyle\frac{SSE}{n-2}})} \]
n=nrow(KO_data)
SSM <- sum((predict(lmResult) - mean(KO_data$FatTissue.percent))^2)
MSE <-sum(lmResult$residuals^2)/(n-2)
MSM <-SSM/1
MSM/MSE
[1] 93.79893
summary(lmResult)$fstatistic
value numdf dendf
93.79893 1.00000 38.00000
Exercise on this session can be found here
Answers can be found here